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Abstract 

Motivated by the problem of effectively executing clustering algo- 
rithms on very large data sets, we address a model for large scale dis- 
tributed clustering methods. To this end, we briefly recall some stan- 
dards on the quantization problem and some results on the almost sure 
convergence of the Competitive Learning Vector Quantization (CLVQ) 
procedure. A general model for linear distributed asynchronous algo- 
rithms well adapted to several parallel computing architectures is also 
discussed. Our approach brings together this scalable model and the 
CLVQ algorithm, and we call the resulting technique the Distributed 
Asynchronous Learning Vector Quantization algorithm (DALVQ). An 
in-depth analysis of the almost sure convergence of the DALVQ al- 
gorithm is performed. A striking result is that we prove that the 
multiple versions of the quantizers distributed among the processors 
in the parallel architecture asymptotically reach a consensus almost 
surely. Furthermore, we also show that these versions converge almost 
surely towards the same nearly optimal value for the quantization cri- 
terion. 



Keywords — fc-means, vector quantization, distributed, asynchronous, 
stochastic optimization, scalability, distributed consensus. 



1 



1 Introduction 



Distributed algorithms arise in a wide range of applications, including telecom- 
munications, distributed information processing, scientific computing, real 
time process control and many others. Parallelization is one of the most 
promising ways to harness greater computing resources, whereas building 
faster serial computers is increasingly expensive and also faces some physical 
limits such as transmission speeds and miniaturization. One of the challenges 
proposed for Machine Learning is to build scalable applications that quickly 
process large amounts of data in sophisticated ways. Building such large 
scale algorithms attacks several problems in a distributed framework, such 
as communication delays in the network or numerous problems caused by 
the lack of shared memory. 

Clustering algorithms are one of the primary tools of unsupervised learning. 
From a practical perspective, clustering plays an outstanding role in data 
mining applications such as text mining, web analysis, marketing, medical 
diagnostics, computational biology and many others. Clustering is a sepa- 
ration of data into groups of similar objects. As clustering represents the 
data with fewer clusters, there is a necessary loss of certain fine details, but 
simplification is achieved. The popular Competitive Learning Vector Quan- 
tization (CLVQ) algorithm (see Gersho and Gray [22]) provides a technique 
for building reliable clusters characterized by their prototypes. As pointed 
out by Bottou in [11], the CLVQ algorithm can also be viewed as the on-line 
version of the widespread Lloyd's method (see Lloyd's [29] for the definition) 
which is referred to as batch fc-means in [11]. The CLVQ also belongs to 
the class of stochastic gradient descent algorithms (for more information on 
stochastic gradient descent procedures we refer the reader to Benveniste et 

ai. m- 

The analysis of parallel stochastic gradient procedures in a Machine Learning 
context has recently received a great deal of attention (see for instance Zinke- 
vich et al. [12] and Mac Donald et al. 01]). In the present paper, we go further 
by introducing a model that brings together the original CLVQ algorithm and 
the comprehensive theory of asynchronous parallel linear algorithms devel- 
oped by Tsitsiklis [H], Tsitsiklis et al. [IQ] and Bertsekas and Tsitsiklis [6]. 
The resulting model will be called Distributed Asynchronous Learning Vec- 
tor Quantization (DALVQ for short). At a high level, the DALVQ algorithm 
parallelizes several executions of the CLVQ method concurrently on differ- 
ent processors while the results of these algorithms are broadcast through 
the distributed framework asynchronously and efficiently. Here, the term 
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processor refers to any computing instance in a distributed architecture (see 
BuUo et al. [131 Chapter 1] for more details). Let us remark that there is 
a series of pubhcations similar in spirit to this paper. Indeed in Frasca et 
al. |21] and in Durham et al. [17], a coverage control problem is formulated 
as an optimization problem where the functional cost to be minimized is the 
same of the quantization problem stated in this manuscript. 

Let us provide a brief mathematical introduction to the CLVQ technique and 
DALVQ algorithms. The first technique computes quantization scheme for 
d dimensional samples zi,Z2,... using the following iterations on a (M*^)** 
vector, 

w{t + 1) = w{t) - St+iH (zi+i, w{t)) , t > 0. 

In the equation above, w(0) G (M'^)'' and the St are positive reals. The vector 
iJ(z, w) is the opposite of the difference between the sample z and its nearest 
component in w. Assume that there are M computing entities, the data are 
split among the memory of these machines: z^^, Zg, . . ., where i G {1, . . . , M}. 
Therefore, the DALVQ algorithms are defined by the M iterations {w*(t)}^Q, 
called versions, satisfying (with slight simplifications) 

M 

w\t + l) = J2<^^'\t)w^{T^'^{t))-el_,,H {zl_,„w\t)) , (1.1) 
i=i 

i G {1, . . . ,M} and t > 0. The time instants T'''^{t) > are deterministic 
but unknown and the delays satisfy t — r'^'^t) > 0. The families {a^'^ {t)}Y=i 
define the weights of convex combinations. 

As a striking result, we prove that multiple versions of the quantizers, dis- 
tributed among the processors in a parallel architecture, asymptotically reach 
a consensus almost surely. Using the materials introduced above, it writes 

w\t) - w\t) > 0, (i, j) G {1, . . . ,M}^, almost surely (a.s.). 

Furthermore, we also show that these versions converge almost surely to- 
wards (the same) nearly optimal value for the quantization criterion. These 
convergence results are similar in spirit to the most satisfactory almost sure 
convergence theorem for the CLVQ algorithm obtained by Pages in [33] . 

For a given time span, our parallel DALVQ algorithm is able to process much 
more data than a single processor execution of the CLVQ procedure. More- 
over, DALVQ is also asynchronous. This means that local algorithms do not 
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have to wait at preset points for messages to become available. This allows 
some processors to compute faster and execute more iterations than others, 
and it also allows communication delays to be substantial and unpredictable. 
The communication channels are also allowed to deliver messages out of or- 
der, that is, in a different order than the one in which they were transmitted. 
Asynchronism can provide two major advantages. First, a reduction of the 
synchronization penalty, which could bring a speed advantage over a syn- 
chronous execution. Second, for potential industrialization, asynchronism 
has greater implementation flexibility. Tolerance to system failures and un- 
certainty can also be increased. As in the case with any on-line algorithm, 
DALVQ also deals with variable data loads over time. In fact, on-line algo- 
rithms avoid tremendous and non scalable batch requests on all data sets. 
Moreover, with an on-line algorithm, new data may enter the system and be 
taken into account while the algorithm is already running. 

The paper is organized as follows. In Section [2] we review some standard 
facts on the clustering problem. We extract the relevant material from Pages 
P5] without proof, thus making our exposition self-contained. In Section [3] 
we give a brief exposition of the mathematical framework for parallel asyn- 
chronous gradient methods introduced by Tsitsiklis et al. in [IQ] and Bert- 
sekas and Tsitsiklis |10l[6]. The results of Blondel et al. [8] on the asymptotic 
consensus in asynchronous parallel averaging problems are also recalled. In 
Section HI our main results are stated and proved. 



2 Quantization and CLVQ algorithm 
2.1 Overview 

Let /i be a probability measure on with finite second-order moment. The 
quantization problem consists in finding a "good approximation" of by 
a set of K vectors of called quantizer. Throughout the document the 
K, quantization points (or prototypes) will be seen as the components of a 
(M'^) '"-dimensional vector w = {wi, . . . ^w^)- To measure the correctness 
of a quantization scheme given by w, one introduces a cost function called 
distortion, defined by 

Cu{u)) = l- I min \\z - we\f dfi{z). 

Under some minimal assumptions, the existence of an optimal quantizer vec- 
tor w° G argmin^g|'jgd\K C^{w) has been established by Pollard in [35] (see 
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also Sabin and Gray [38, Appendix 2]). 

In a statistical context, the distribution is only known through n indepen- 
dent random observations Zi, . . . , z„ drawn according to /i. Denote by /i„ the 
empirical distribution based on zi, . . . , z„, that is, for every Borel subset A 
of M'^ 

1 " 

i=i 

Much attention has been devoted to the convergence study of the quantiza- 
tion scheme provided by the empirical minimizers 

m;° e argminC^„(M;). 

The almost sure convergence of towards min^gj^jgd-j^ C^(u^) was proved 

by Pollard in [Ml [35] and Abaya and Wise in [2] . Rates of convergence and 
nonasymptotic performance bounds have been considered by Pollard 
Chou [H], Linder et al. [2H], Bartlett et al. 0], Linder PZUSH], Antos [I] and 
Antos et al. |3]. Convergence results have been established by Biau et al. in 
[7] where is a measure on a Hilbert space. It turns out that the minimiza- 
tion of the empirical distortion is a computationally hard problem. As shown 
by Inaba et al. in [21], the computational complexity of this minimization 
problem is exponential in the number of quantizers k and the dimension of 
the data d. Therefore, exact computations are untractable for most of the 
practical apphcations. 

Based on this, our goal in this document is to investigate effective methods 
that produce accurate quantizations with data samples. One of the most 
popular procedure is Lloyd's algorithm (see Lloyd [29]) sometimes refereed 
to as batch /c-means. A convergence theorem for this algorithm is provided 
by Sabin and Gray in [38]. Another celebrated quantization algorithm is the 
Competitive Learning Vector Quantization (CLVQ), also called on-line k- 
means. The latter acronym outlines the fact that data arrive over time while 
the execution of the algorithm and their characteristics are unknown until 
their arrival times. The main difference between the CLVQ and the Lloyd's 
algorithm is that the latter run in batch training mode. This means that 
the whole training set is presented before performing an update, whereas the 
CLVQ algorithm uses each item of the training sequence at each update. 

The CLVQ procedure can be seen as a stochastic gradient descent algorithm. 
In the more general context of gradient descent methods, one cannot hope 
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for the convergence of the procedure towards global minimizers with a non 
convex objective function (see for instance Benveniste et al. |^). In our quan- 
tization context, the distortion mapping is not convex (see for instance 
Graf and Luschgy [23j). Thus, just as in Lloyd's method, the iterations pro- 
vided by the CLVQ algorithm converge towards local minima of C^. 

Assuming that the distribution /i has a compact support and a bounded 
density with respect to the Lebesgue measure. Pages states in [33] a result 
regarding the almost sure consistency of the CLVQ algorithm towards critical 
points of the distortion C^. The author shows that the set of critical points 
necessarily contains the global and local optimal quantizers. The main diffi- 
culties in the proof arise from the fact that the gradient of the distortion is 
singular on K-tuples having equal components and the distortion function 
is not convex. This explains why standard theories for stochastic gradient 
algorithm do not apply in this context. 



2.2 The quantization problem, basic properties 

In the sequel, we denote by Q the closed convex hull of supp(/i), where 
supp (/i) stands for the support of the distribution. Observe that, with this 
notation, the distortion mapping is the function C : {W^Y — > [0, oo) defined 

by 

Ciw) = - I min llz-uiff d/i(z), w = iwi, . . . ,w^ e iW^Y . (2.1) 

2 Jg l<t<K. w r~\ J ' / V / 

Throughout the document, with a slight abuse of notation, ||.|| means both 
the Euclidean norm of or (M'^)'^. In addition, the notation stands for 
the set of all vector of {W^^ with pairwise distinct components, that is, 

^{we {M!^Y l^iT^Wk if and only if £ ^ A;} . 

Under some extra assumptions on /i, the distortion function can be rewritten 
using space partition set called Voronoi tessellation. 

Definition 2.1 Let w G (M'^)'', the Voronoi' tessellation of Q related to w is 
the family of open sets {W£{w)}^^^^^ defined as follows: 

• IfwE V^, for alll<i<K, 



WAw) = {veg 



\Wi — v\\ < min Wwf^ — v\ 



6 



• Ifwe {R'^y \ V^, for alll<i<K, 



if i = min {k \ Wk = w^} 



We{w) = 



Wi — v\\ < min \\wk — v 



— otherwise, Wt{w) = 0. 

As an illustration, Figure [T] shows Voronoi tessellations associated to a vec- 
tor w G ([0, 1] X [0, 1])^" whose components have been drawn independently 
and uniformly. This figure also highlights a remarkable property of the cell 
borders, which are portions of hj^erplanes (see Graf and Luschgy |23]). 



Figure 1: Voronoi tessellation of 50 points of drawn uniformly in a square. 

Observe that if fi{H) is zero for any hyperplane H of M*^ (a property which 
is sometimes referred to as strong continuity) then using Definition 12.11 it is 
easy to see that the distortion takes the form: 





K 



w 




(2.2) 
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The following assumption will be needed throughout the paper. This as- 
sumption is similar to the peak power constraint (see Chou [H] and Linder 
[39]). Note that most of the results of this subsection are still valid if /i 
satisfies the weaker strong continuity property. 

Assumption 2.1 (Compact Supported Density) The probability mea- 
sure fi has a bounded density with respect to the Lebesgue measure on R''. 
Moreover, the support of fi is equal to its convex hull Q, which in turn, is 
compact. 

The next proposition states the differentiability of the distortion C, and 
provides an explicit formula for the gradient VC whenever the distortion is 
differentiable. 

Proposition 2.1 (Pages [33j) Under Assumption \2.1[ the distortion C is 
continuously differentiable at every w = (i^i, . . . , ly^) G Furthermore, 
for all 1 < i < K, 

VeC{w)= / {wi- z)dfi{z). 

JWt{w) 

Some necessary conditions on the location of the minimizers of C can be 
derived from its differentiability properties. Therefore, Proposition 12. 2l below 
states that the minimizers of C have parted components and that they are 
contained in the support of the density. Thus, the gradient is well defined 
and these minimizers are necessarily some zeroes of VC. For the sequel it is 

convenient to let A be the interior of any subset A of (M ) . 



Proposition 2.2 (Pages |33] ) Under Assumption \2.1\ we have 

o 

argminC(t(;) C argminloc C ^Tl {VC = 0} fl V;, 

where argminloc^g^^ C{w) stands for the set of local minimizers of C over 

For any z G M*^ and w G (M'^)'^, let us define the following vector of (M'^)"^ 

H{z, w) = {{wi- z) lizew-.H}) i<£<K • (2-3) 

On P^, the function H may be interpreted as an observation of the gradient. 
With this notation. Proposition 12.11 states that 

VCiw) = [ H{z,w)dfx{z), weV^. (2.4) 

Jg 
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Let stands for the complementary in (M'^)'^ of a subset A C (M'')'^. 
Clearly, for all w G CT';^, the mapping H{.,w) is integrable. Therefore, 
VC can be extended on (R"') via the formula 

h{w) = [ H{z,w)dfx{z), we{M.'^Y. (2.5) 
Jg 

Note however that the function h, which is sometimes called the average 
function of the algorithm, is not continuous. 

Remark 2.1 Under Assumption \2.1\ a computation for all w G 'D'^ of the 
Hessian matrix V^C {w) can he deduced from Theorem 4 of Fort and Pages 
JTM/. In fact, the formula established in this theorem is valid for cost func- 
tions which are more complex than C (they are associated to Kohonen Self 
Organizing Maps, see Kohonen 12^ for more details). In Theorem 4, letting 
(j{k) = l{fc=o}j provides the result for our distortion C. The resulting formula 
shows that h is singular on and, consequently, that this function cannot 
be Lipschitz on Q'^. 

2.3 Convergence of the CLVQ algorithm 

The problem of finding a reliable clustering scheme for a dataset is equivalent 
to find optimal (or at least nearly optimal) minimizers for the mapping C. 
A minimization procedure by a usual gradient descent method cannot be 
implemented as long as VC is unknown. Thus, the gradient is approximated 
by a single example extracted from the data. This leads to the following 
stochastic gradient descent procedure 

w{t + l) = w{t)-€t+iH{zt+uw{t)), t>0, (2.6) 

o 

where w{0) E Q'^ H and zi, Z2 . . . are independent observations distributed 
according to the probability measure /i. 

The algorithm defined by the iterations 02.61) is known as the CLVQ algo- 
rithm in the data analysis community. It is also called the Kohonen Self 
Organizing Map algorithm with neighbor (see for instance Kohonen [25] ) 
or the on-line A;- means procedure (see MacQueen [30] and Bottou [lO]) in 
various fields related to statistics. As outlined by Pages in [33], this algo- 
rithm belongs to the class of stochastic gradient descent methods. However, 
the almost sure convergence of this type of algorithm cannot be obtained by 
general tools such as Robbins-Monro method (see Robbins and Monro [37] ) 
or the Kushner-Clark's Theorem (see Kushner and Clark [26]). Indeed, the 
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main difficulty essentially arises from the non convexity of the function C, 
its non coercivity and the singularity of h at (we refer the reader to 
Section 6] for more details). 

The following assumption set is standard in a gradient descent context. It 
basically upraises constraints on the decreasing speed of the sequence of steps 

{^t}"o- 

Assumption 2.2 (Decreasing steps) The (0, 1) -valued sequence {^tj^o 
satisfies the following two constraints: 

1- J2Zo^t = oo. 

2- ET=o^t<oo. 

An examination of identities (12. 6 p and (12. 3p reveals that if zt+i G We^ {w(t)), 
where £o e {1, . . . , M} then 

Wio{t+ 1) = (1 - £t+l)W(>o{t) +£t+lZt+i. (2.7) 

The component we^it + 1) can be viewed as the image of wegit) by a zt+i- 
centered homothety with ratio 1 — e^+i (Figure [2] provides an illustration of 
this fact). Thus, under Assumptions 12. II and l2. 21 the trajectories of {w{t)}'^Q 

o 

stay in Q'^ fl V^. More precisely, if 

w{0) e n V; 

then 

w{t) e nv^, t> 0, a.s. 

Although VC is not continuous some regularity can be obtained. To this 
end, we need to introduce the following materials. For any 6 > and any 
compact set L C M'^, let the compact set C (W^)'^ be defined as 



= <w e L'' \ mm\\we-Wk\\>6\ . (2.8) 

The next lemma that states on the regularity of VC will prove to be ex- 
tremely useful in the proof of Theorem 12.21 and throughout Section m 

Lemma 2.1 (Pages [33]) Assume that fi satisfies Assumption \2.1\ and let 
L be a compact set ofW^. Then, there is some constant Ps such that for all 
w and V in Lg with [w, v] C V^, 

\\VC{w)-VC{v)\\ <Ps\\w-v\\. 
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Figure 2: Drawing of a portion of a 2- dimensional Voronoi tessellation. For 
t > 0, if Zf+i e We^ {w{t)) then Wi{t + 1) = we{t) for all £ 7^ and weo{t + 1) 
lies in the segment [w£g(t),Zt+i]. The update of the vector wi^it) can also be 
viewed as a Zj+i-centered homothety with ratio 1 — Et+i- 

The following lemma, called G-Lemma in [33] is an easy-to-apply convergence 
results on stochastic algorithms. It is particularly adapted to the present 
situation of the CLVQ algorithm where the average function of the algorithm 
h is singular. 

Theorem 2.1 (G-Lemma, Fort and Pages [20]) Assume that the itera- 
tions 112. 6\} of the CLVQ algorithm satisfy the following conditions: 

1. Ylu=i = 00 and Et 0. 

2. The sequences {wit)}'^^ and {h{w(t))}'^Q are hounded a.s. 

3. The series Yl^o^t+i {H [7,1+1, w{t)) — h{w{t))) converge a.s. in {W^Y ■ 

4- There exists a lower semi- continuous function G : (M"')'' — > [0, C)o) 
such that 

00 

''^^£t+iG{w{t)) < 00, a.s. 

t=o 

Then, there exists a random connected component S of {G = 0} such that 

dist(w(t),H) > 0, a.s., 

where the symbol dist denotes the usual distance function between a vector 
and a subset of (M'^) . Note also that if the connected components of {G = 0} 
are singletons then there exists C, G {G = 0} such that w{t) )• ^ a.s. 
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For a definition of tlie topological concept of connected component, we refer 
the reader to Choquet [U]. The interest of the G-Lemma depends upon the 
choice of G. In our context, a suitable lower semi-continuous function is G 
defined by 

G{w)^ hminf \\WC{v)\\\ w^Q\ (2.9) 

The next theorem is, as far as we know, the first almost sure convergence 
theorem for the stochastic algorithm CLVQ. 

Theorem 2.2 (Pages |33] ) Under Assumptions \2.i\ and \2.2\ conditioned 
on the event 

<^liminf dist (w{t),{:V^) > L one has 
dist(w(t), Soo) > 0, a.s., 

t^oo 

where Sqo is some random connected component of {VC = 0}. 

The proof is an application of the above G-Lemma with the mapping G 
defined by equation (12.91) . Theorem 12.21 states that the iterations of the 
CLVQ necessarily converge towards some critical points (zeroes of VC). 
From Proposition 12.21 we deduce that the set of critical points necessarily 
contains optimal quantizers. Recall that without more assumption than 

o 

w{0) E Q'^ n V^, we have already discussed the fact that the components of 
w{t) are almost surely parted for all t > 0. Thus, it is easily seen that the 
two following events only differ on a set of zero probability 

liminf dist (w{t),{lV^) > 

and 

'inf dist {w{t),{:V':) >0\. 

Some results are provided by Pages in [33j for asymptotically stuck compo- 
nents but, as pointed out by the author, they are less satisfactory. 

3 General distributed asynchronous algorithm 
3.1 Model description 

Let s{t) be any (R'') '"-valued vector and consider the following iterations on 
a vector w G (M'^)'" 

w{t+l) =w{t) + s{t), t>0. (3.1) 
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Here, the model of discrete time described by iterations fl3.ip can only be 
performed by a single computing entity. Therefore, if the computations of 
the vectors s{t) are relatively time consuming then not many iterations can 
be achieved for a given time span. Consequently, a parallelization of this 
computing scheme should be investigated. The aim of this section is to dis- 
cuss a precise mathematical description of a distributed asynchronous model 
for the iterations f l3.ip . This model for distributed computing was originally 
proposed by Tsitsiklis et al. in |10] and was revisited in Bertsekas and Tsit- 
sikhs Section 7.7]. 

Assume that we dispose of a distributed architecture with M computing en- 
tities called processors (or agents, see for instance Bullo et al. [13] )• Each 
processor is labeled, for simplicity of notation, by a natural number i G 
{!,..., M}. Throughout the paper, we will add the superscript i on the 
variables possessed by the processor i. In the model we have in mind, each 
processor has a buffer where its current version of the iterated vector is kept, 
i.e a local memory. Thus, for agent i such iterations are represented by the 
(R'^) ''-valued sequence {'W*(i)}^o- 

Let t > denote the current time. For any pair of processors G 
{1, . . . , M}^, the value kept by agent j and available for agent i at time t is not 
necessarily the most recent one, w^{t), but more probably and outdated one, 
[t^'^ {t)) , where the deterministic time instant T^'^{t) satisfy < T^'^{t) < t. 
Thus, the difference t — r*'-^ (t) can be seen as a communication delay. This is 
a modeling of some aspects of the network: latency and bandwidth finiteness. 

We insist on the fact that there is a distinction between "global" and "local" 
time. The time variable we refer above to as t corresponds to a global clock. 
Such a global clock is needed only for analysis purposes. The processors work 
without knowledge of this global clock. They have access to a local clock or 
to no clock at all. 

The algorithm is initialized at t = 0, where each processor i G {!,..., M} 
has an initial version w*(0) G (M^)'^ in its buffer. We define the general 
distributed asynchronous algorithm by the following iterations 

A/ 

+ = ^a^'^ (t)«;^ (r^'^ (t)) + s^(t), z G {1, . . . , M} and t > 0. (3.2) 

The model can be interpreted as follows: at time t > 0, processor i receives 
messages from other processors containing {t'^'^ {t)) . Processor i incorpo- 



13 



rates these new vectors by forming a convex combination and incorporates 
the vector resulting from its own "local" computations. The coefficients 
a^'^{t) are nonnegative numbers which satisfy the constraint 

M 

^a''^{t) = l, 2 e {!,..., M} and t>0. (3.3) 
i=i 

As the combining coefficients a^'^{t) depend on t, the network communication 
topology is sometimes referred to as time- varying. The sequences {t'^'^ {f)}^Q 
need not to be known in advance by any processor. In fact, their knowledge 
is not required to execute iterations defined by equation (13. 2p . Thus, we do 
not necessary dispose of a shared global clock or synchronized local clocks at 
the processors. 

As for now the descent terms {s*(t)}^g will be arbitrary (M'^) ''-valued se- 
quences. In Section HI when we define the Distributed Asynchronous Learn- 
ing Vector Quantization (DALVQ), the definition of the descent terms will 
be made more explicit. 



3.2 The agreement algorithm 

This subsection is devoted to a short survey of the results, found by Blondel et 
al. in [8J, for a natural simplification of the general distributed asynchronous 
algorithm (13. 2p . This simplification is called agreement algorithm by Blondel 
et al. and is defined by 

M 

x\t + l) = ^d'^{t)x^{T'^^{t)), iG {1,...,M} andt>0. (3.4) 
i=i 

where x*(0) G (M'^)''. An observation of these equations reveals that they are 
similar to iterations (13. 2p . the only difference being that all descent terms 
equal 0. 

In order to analyse the convergence of the agreement algorithm (13.41) , Blondel 
et al in [8] define two sets of assumptions that enforce some weak properties 
on the communication delays and the network topology. As shown in [8], if 
the assumptions contained in one of these two set hold, then the distributed 
versions of the agreement algorithm, namely the x*, reach an asymptotical 
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Figure 3: Illustration of the time delays introduced in the general distributed 
asynchronous algorithm. Here, there are M = 4 different processors with 
their own computations of the vectors w^'^\ i E {1,2,3,4}. Three arbitrary 
values of the global time t are represented (ti, t2 and ts), with r^'^tk) = tk 
for all i G {1,2,3,4} and 1 < A; < 3. The dashed arrows head towards the 
versions available at time tk for an agent i G {1,2,3,4} represented by the 
tail of the arrow. 



consensus. This latter statement means that there exists a vector x* (inde- 
pendent of i) such that 



t—>-co 



^G{1,...,M}. 



The agreement algorithm fl3.4p is essentially driven by the communication 
times T^'^{t) assumed to be deterministic but do not need to be known a 
priori by the processors. The following Assumption 13.11 essentially ensures, 
in its third statement, that the communication delays t — T^'^{t) are bounded. 
This assumption prevents some processor from taking into account some 
arbitrarily old values computed by others processors. Assumption 13.11 1. is 
just a convention: when a*'-'(t) = the value T^'^{t) has no effect on the 
update. Assumption 13.11 2. is rather natural because processors have access 
to their own most recent value. 
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Assumption 3.1 (Bounded communication delays) 1. Ifa^'^it) = 
then T^'^it) = t, G {1, . . . ,M}^ and t > 0, 

2. T''\t) = t, i e {1,...,M} andt>0. 

3. There exists a positive integer Bi such that 



The next Assumption 13.21 states that the value possessed by agent i at time 
t + 1, namely a;*(t + 1), is a weighted average of its own value and the values 
that it has just received from other agents. 

Assumption 3.2 (Convex combination and threshold) There exists a 
positive constant a > such that the following three properties hold: 

1. a'''{t) >a, i e {1,...,M} andt>0. 

2. a*J(t) G {0} U [a, 1], (i, j) e {1, . . . , M}^ and t > 0. 
3- = 1, ie{l,...,M} andt>0. 

Let us mention one particular relevant case for the choice of the combining 
coefficients a^'^{t). Let i G {1, . . . , M} and t > 0, the set 



corresponds to the set of agents whose version is taken into account by pro- 



where jj^A denotes the cardinal of any finite set A. The above definition 
on the combining coefficients appears to be relevant for practical implemen- 
tations of the model DALVQ introduced in Section m For a discussion on 
others special interest cases regarding the choices of the coefficients a^'^ [t) we 
refer the reader to [8]. 

The communication patterns, sometimes refereed to as the network com- 
munication topology, can be expressed in terms of directed graph. For a 
thorough introduction to graph theory, see Jungnickel [TS] . 

Definition 3.1 (Communication graph) Let us fix t > 0, the communi- 
cation graph at time t, {y,E{t)), is defined by 



t-B,<T''^{t)<t, 



. . . ,M}^ and t > 0. 



N'it) ^ {j G {1, . . . , M} G {1, . . . , M} I a^'^it) ^ O} 
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• the set of vertices V is formed by the set of processors V = {1, . . . , M}, 

• the set of edges E{t) is defined via the relationship 

{j,i) e E{t) if and only if d'^{t) > 0. 

Assumption 13.31 is a minimal condition required for a consensus among the 
processors. More precisely, it states that for any pair of agents G 
{l,...,My there is a sequence of communications where the values com- 
puted by agent i will influence (directly or indirectly) the future values kept 
by agent j. 

Assumption 3.3 (Graph connectivity) The graph {V,Lis>tE{s)) 
is strongly connected for all t >0. 

Finally, we define two supplementary assumptions. The combination of one 
of the two following assumptions with the three previous ones will ensure the 
convergence of the agreement algorithm. As mentioned above, if Assumption 
13.31 holds then there is a communication path between any pair of agents. 
Assumption 13.41 below expresses the fact that there is a finite upper bound 
for the length of such paths. 

Assumption 3.4 (Bounded communication intervals) Ifi communicates 
with j an infinite number of times then there is a positive integer B2 such 
that 

(ij) G E{t) U E{t + 1) U . . . U E{t + B2 - 1), t> 0. 

Assumption 13.51 is a symmetry condition: if agent i G {1, . . . , M} commu- 
nicates with agent j G {1, . . . ,M} then j has communicated or will com- 
municate with i during the time interval {t — B-^,t + B^) where B^ > 0. 



Assumption 3.5 (Symmetry) There exists some B^ > such that when- 
ever G E{t), there exists some r that satisfies \t 
ij,t)eEir). 



r < B^i and 



(AsY), 



To shorten a little bit the notation, we set 
' Assumption I3.lt 
Assumption 13. 2t 
Assumption | 
^ Assumption | 



fAsY) 



' Assumption I3.lt 
Assumption 13. 2t 
Assumption 13. 3t 

^ Assumption 13. 5t 
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We are now in a position to state the main result of this section. The Theo- 
rem [SiT] expresses the fact that, for the agreement algorithm, a consensus is 
asymptotically reached by the agents. 

Theorem 3.1 (Blondel et al. [8J) Under the set of Assumptions (AsY)-^ 
or (AsY)2, there is a consensus vector x* G (M'^)'' (independent of i) such 
that 

lim = 0, ie{l,...,M}. 

Besides, there exist p G [0, 1) and L > such that 

\\x\t) - x\t)\\ < Lp^-\ i G {1,...,M} andt>T>Q. 

3.3 Asymptotic consensus 

This subsection is devoted to the analysis of the general distributed asyn- 
chronous algorithm (13. 2p . For this purpose, the study of the agreement al- 
gorithm defined by equations (13. 4 p will be extremely fruitful. The following 
lemma states that the version possessed by agent i G {1,...,M} at time 
t > 0, namely w^{t), depends linearly on the others initialization vectors 
w^{0) and the descent subsequences {s^ {t)}I.J_-^^, where j G {1, . . . , M}. 

Lemma 3.1 (Tsitsiklis [41j) For all {i,j) G {1, . . . , M}^ and t > 0, there 
exists a real-valued sequence {(p^'^ (^)T)}t=^_i such that 

M t-l M 

w\t) = <t^'' it, -1) ^'(0) + E E ^) ^'(^)- 

j = l T=0 j = l 

For all (i, j) G {1, . . . , M} and t > 0, the real-valued sequences {(j)'^'^ (t, r)}^^_ 
do not depend on the value taken by the descent terms The real num- 

bers 0*'-^ i't,T) are determined by the sequences {t*'''(t)}^^q and {a'''-' {t)}1.^q 
which do not depend on w. These last sequences are unknown in general, but 
some useful qualitative properties can be derived, as expressed in Lemma \^72\ 
below. 

Lemma 3.2 (Tsitsiklis [41]) Forall{i,j) G {1,...,M}^ let {(p''^ {t,T)y~^_ 
be the sequences defined in Lemma \3. il 

1. Under Assumption \3.^ 

< 0*'-'(t,r) < 1, {i,j) G {1,...,M}^ andt>r> -1. 
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2. Under Assumptions (AsY)-^ or (AsY)2, we have: 

(a) For all G {1, . . . , M}^ and r > —1, the limit of (j)^'^ {t, r) as t 
tends to infinity exists and is independent of j . It will be denoted 
0^(r). 

(b) There exists some ?7 > such that 

(j)\r) > r], iE{l,...,M} andT>-l. 

(c) There exist a constant A > and p G (0, 1) such that 

|0''^'(t,r) -0Xr)| < Ap'-\ {i,j) G {1,...,M}^ and t > t > -1. 

Take t' > and assume that the agents stop performing update after time t', 
but keep communicating and merging the results. This means that s^{t) = 
for all t > t'. Then, equations (13.21) write 

M 

w\t + 1) = ^a*'^(t)w^' (r*'^(t)), i G {1, . . . ,M} and t > t' . 
i=i 

If Assumptions (AsY)j^ or (AsY)2 are satisfied then Theorem 13.11 shows that 
there is a consensus vector, depending on the time instant t'. This vector 
will be equal to w*{t') defined below (see Figure H]). Lemma (3.21 provides a 
good way to define the sequence {w*{t)}^Q as shown in Definition 13. 4[ Note 
that this definition does not involve any assumption on the descent terms. 

Definition 3.2 (Agreement vector) Assume that Assumptions (AsY)-^ 
or (AsY)2 are satisfied. The agreement vector sequence {'w*{t)}'^Q is de- 
fined by 

M t-1 M 

w*(t) A^0j(_l)^i(O) + ^^0^ (r)s^(r), t>0. 

j = l T = j = l 

It is noteworthy that the agreement vector sequence w* satisfies the following 
recursion formula 

M 

w;*(t + l) =w*(t) + ^0^(t)s^(t), t>0. (3.5) 
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Global 
time reference 



Averaging and computation 
with descent terms 
(general distributed 
asynclironous algorithm). 



t' 



Only averaging 
(agreement algorithm). 



Figure 4: The agreement vector at time t', w*{t') corresponds to the common 
value asymptotically achieved by all processors if computations integrating 
descent terms have stopped after t', i.e, s^{t) = for all t > t'. 

4 Distributed asynchronous learning vector 
quantization 

4.1 Introduction, model presentation 

From now on, and until the end of the paper, we assume that one of the 
two set of assumptions (AsY)-^ or (AsY)2 holds, as well as the compact- 



supported density Assumption 12. 1[ In addition, we will also assume that 
E Q. For the sake of clarity, all the proofs of the main theorems as well as 
the lemmas needed for these proofs have been postponed at the end of the 
paper, in Annex. 

Tsitsiklis in [JT] , Tsitsiklis et al in jlQ] and Bertsekas and Tsitsiklis in [6] stud- 
ied distributed asynchronous stochastic gradient optimization algorithms. In 
this series of publications, for the distributed minimization of a cost func- 
tion F : (R'^)'' M, the authors considered the general distributed asyn- 
chronous algorithm defined by equation (13.21) with specific choices for stochas- 
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tic descent terms s\ Using the notation of Section [31 the algorithm writes 

M 

w'{t + 1) = ^a*'^(t)u;^(r^'^(t)) + s\t), z G {1, . . . , M} and t > 0, 
with stochastic descent terms s\t) satisfying 

E {s'{t) I s^{t), j e {1, . . . , M} and t > r > 0} = -^j+iVF {w%t)) , 

i G {1, ...,M} and t > 0. (4.1) 

where {£1}'^q are decreasing steps sequences. The definition of the descent 
terms in [HI SO] is more general than the one appearing in equation f l4.ip . 
We refer the reader to Assumption 3.2 and 3.3 in [IQ] and Assumption 8.2 
in [6] for the precise definition of the descent terms in [6l HQ]. As discussed 
in Section [21 the CLVQ algorithm is also a stochastic gradient descent pro- 
cedure. Unfortunately, the results from Tsitisklis et al. in [101 HH [S] do not 
apply with our distortion function, C, since the authors assume that F is 
continuously differentiable and V-F is Lipschitz. Therefore, the aim of this 
section is to extend the results of Tsitsiklis et al. to the context of vector 
quantization and on-line clustering. 

We first introduce the Distributed Asynchronous Learning Vector Quantiza- 
tion (DALVQ) algorithm. To prove its almost sure consistency, we will need 
an Asynchronous G-Lemma, which is inspired from the G-Lemma, Theorem 
12. H presented in Section [2l This theorem may be seen as an easy-to-apply 
tool for the almost sure consistency of a distributed asynchronous system 
where the average function is not necessary regular. Our approach sheds 
also some new light on the convergence of distributed asynchronous stochas- 
tic gradient descent algorithms. Precisely, Proposition 8.1 in [40j claims that 
liminff_i.oo II V-F(i/;*(t))|| = while our main Theorem 14.21 below states that 
limt_>oo II VC(w*(t))|| = 0. However, there is a price to pay for this more 
precise result with the non Lipschitz gradient VC. Similarly to Pages |33] . 
who assumes that the trajectory of the CLVQ algorithm has almost surely 
asymptotically parted components (see Theorem 12.21 in Section [2]), we will 
suppose that the agreement vector sequence has, almost surely, asymptoti- 
cally parted component trajectories. 

Recall that the goal of the DALVQ is to provide a well designed distributed 
algorithm that processes quickly (in term of wall clock time) very large data 
sets to produce accurate quantization. The data sets (or streams of data) 
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are distributed among several queues sending data to the different processors 
of our distributed framework. Thus, in this context the sequence z^,Z2, . . . 
stands for the data available for processor, where i G {1, . . . , M}. The ran- 
dom variables 



z^ z^ z^ 

are assumed to be independent and identically distributed according to /i. 

In the definition of the CLVQ procedure (12. 6p . the term H {zt+i,w(t)) can be 
seen as an observation of the gradient VC {w{t)). Therefore, in our DALVQ 
algorithm, each processor i G {1, . . . ,M} is able to compute such observa- 
tions using its own data z\,Z2j . . .. Thus, the DALVQ procedure is defined 
by equation (13. 2p with the following choice for the descent term s*: 

otherwise; 

where {£j}^o ('-'' l)-valued sequences. The sets T* contain the time in- 
stants where the version w'^, kept by processor i, is updated with the descent 
terms. This fine grain description of the algorithm allows some processors to 
be idle for computing descent terms (when t ^ T*). This reflects the fact that 
the computing operations might not take the same time for all processors, 
which is precisely the core of asynchronous algorithms analysis. Similarly 
to time delays and combining coefficients, the sets T* are supposed to be 
deterministic but do not need to be known a priori for the execution of the 
algorithm. 

In the DALVQ model, randomness arises from the data z. Therefore, it is 
natural to let {J^tj^g be the filtration built on the a- algebras 

Tt = (r {zi, i G {1, . . . , M} and t > s > 0) , t > 0. 

An easy verification shows that, for all j G {!,..., M} and t > 0, w*(t) and 
w^{t) are J-^-measurable random variables. 

For simplicity, the assumption on the decreasing speed of the sequences 
{^t}t=o strengthened as follows. The notation aV6 stands for the maximum 
of two reals a and b. 

Assumption 4.1 There exist two real numbers Ki > and K2 > 1 such 
that 

— -<ei,.< — ie{l,...,M} andt>0. 
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If Assumption 14.11 holds then the sequences {^tj^g satisfy the standard As- 
sumption 12.21 for stochastic optimization algorithms. Note that the choice 
of steps proportional to 1/t has been proved to be a satisfactory learning 
rate, theoretically speaking and also for practical implementations (see for 
instance Murata ^Jj and Bottou and LeCun [T2]). 

For practical implementation, the sequences {el^i}'^^ satisfying Assumption 
I4.1l can be implemented without a global clock, that is, without assuming that 
the current value of t is known by the agents. This assumption is satisfied, for 
example, by taking the current value of e] proportional to where n] is 
the number of times that processor i as performed an update, i.e., the cardi- 
nal of the set T* fl {0, . . . ,t}. For a given processor, if the time span between 
consecutive updates is bounded from above and from below, a straightfor- 
ward examination shows that the sequence of steps satisfy Assumption 14.11 

Finally, the next assumption is essentially technical in nature. It enables to 
avoid time instants where all processors are idle. It basically requires that, 
at any time t > 0, there is at least one processor i G {1, . . . , M} satisfying 
s'{t) ^ 0. 

Assumption 4.2 One has YljLi l{teTJ} > 1 for all t > 0. 
4.2 The asynchronous G-Lemma 

The aim of this subsection is to state a useful theorem similar to Theorem 
12. H but adapted to our asynchronous distributed context. The precise Defi- 
nition [3]2]of the agreement vector sequence should not cast aside the intuitive 
definition. The reader should keep in mind that the vector w*{t) is also the 
asymptotical consensus if descent terms are zero after time t. Consequently, 
even if the agreement vector {w*(t)}^o adapted to the filtration {J-'t}'^Q, 
the vector w*{t) cannot be accessible for a user at time t. Nevertheless, 
the agreement vector w*{t) can be interpreted as a "probabilistic state" of 
the whole distributed quantization scheme at time t. This explains why the 
agreement vector is a such convenient tool for the analysis of the DALVQ 
convergence and will be central in our adaptation of G-Lemma, Theorem 14. 1[ 
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Let us remark that equation fl3.5p . writes for alH > 0, 

M 

w\t + 1) = w\t) + ^{t)s^{t) 

M 

= w*{t) - J2 liteT.}<P'it)eUiH {zU„ w^{t)) . 
We recall the reader that the [0, 1] -valued functions ^■' 's are defined in Lemma 



Using the function h defined by identity (12.51) and the fact that the random 
variables w*{t) and w^{t) are J-^-measurable then it holds 

h{w*{t)) =E{H (z, w*{t)) \J^t}, t>0. 

and 

h{w^{t)) = E{H (z, w^{t)) I j;} , J G {1, . . . , M} and t > 0. 

where z is a random variable of law fi independent of J^f 
For alH > 0, set 

M 

<^i^5^1|,eT.}</>^W^^+i- (4-3) 
i=i 

Clearly, the real numbers are nonnegative. Their strictly positiveness will 

be discussed in Proposition 14. 1[ 

Set 

M 

AMi''> ^J2MteT.}^ityt+i {h{w''{t)) - h{w^{t))), t > 0, (4.4) 
i=i 

and 

M 

AMi''> ^Y.MteT.}<l^it)4+i{h{w'it))-H{zU,w^{t))), t>0. (4.5) 
Note that e|aM(*'^''| = and, consequently, that the random variables 

(2) 

AMj can be seen as the increments of a martingale with respect to the 
filtration {J-i}^Q. 
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Finally, with this notation, equation f l3.5p takes the form 

w^it + 1) = w*{t) - + AM^^^ + AMf \ t > 0. (4.6) 

We are now in a position to state our most useful tool, which is similar 
in spirit to the G-Lemma, but adapted to the context of distributed asyn- 
chronous stochastic gradient descent algorithm. 

Theorem 4.1 (Asynchronous G-Lemma) Assume that (AsY) ^ or (AsY) 

and A s sumption \ 2. 1\ hold and that the following conditions are satisfied: 

1- Et=o = - — > 0- 

2. The sequences {w*(t)}^Q and {h{w* {t))}'^^ are bounded a.s. 

3. The series Ylt^o^^t^^ '^^^ Xlt^o converge a.s. in (R'')''. 

4- There exists a lower semi- continuous function G : (R'^)'' — > [0, C)o) 
such that 

oo 
t=0 

Then, there exists a random connected component S of {G = 0} such that 

dist (w*(t), S) 0, a.s. 



4.3 Trajectory analysis 

The Pages's proof in [33j| on the almost sure convergence of the CLVQ 
procedure required a careful examination of the trajectories of the process 
{w(t)}^Q. Thus, in this subsection we investigate similar properties and in- 
troduce the assumptions that will be needed to prove our main convergence 
result. Theorem 14.21 

The next Assumption 14.31 ensures that, for each processor, the quantizers 
stay in the support of the density. 

Assumption 4.3 One has 

P {w^it) eg"} = 1, j e {1,...,M} andt>0. 
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Firstly, let us mention that since the set Q'^ is convex, if Assumption 14.31 
holds then 

Fiw^it) eg"} = 1, t>0. 

Secondly, note that the Assumption 14.31 is not particularly restrictive. This 
assumption is satisfied under the condition: for each processor, no descent 
term is added while a combining computation is performed. This writes 

ai j{t) = 6ij and r*'*(t) = t, (i, j) G {1, . . . , M}^ and t G T\ 

This requirement makes sense for practical implementations. 

Recall that if t ^ T*, then s*(t) = 0. Thus, equation fl3.2p takes the form 



w\t + l)={ = (1 - el_,,) w^it) + el^,zl^, ' (4.7) 



w'{t + 1) = Ef=i a''^{t)w^{T''^{t)) 



otherwise. 



Since Q'^ is a convex set, it follows easily that if w^{0) E Q'^, then w^(t) G Q'^ 
for all j G {1, . . . , M} and t > and, consequently, that Assumption 14.31 
holds. 

The next Lemma H?T] provides a deterministic upper bound on the differences 
between the distributed versions and the agreement vector. For any subset 
A of (M'^)'^, the notation diam(y4) stands for the usual diameter defined by 

diam(y4) = sup — 



Lemma 4.1 Assume (AsY)-^ or (AsY)2 holds and that Assumptions \2.1\ 
4-1 and \4.3\ are satisfied then 



Ww'-it) -w\t)\\ < v^Mdiam(^)Air2^t, i G {1,...,M} andt>0, a.s., 

where 6t — X]t=-i TvT/'*^^' ^ '^'^'^ P ^'^^ constants introduced in Lemma 
\3.2\ K2 is defined in Assumption^. 1 . 



The sequence {Ot}t^Q defined in Lemma WA\ satisfies 



00 Q 

Ot > and V - < cx). (4.^ 



We give some calculations justifying the statements at the end of the Annex. 
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Thus, under Assumptions 14.11 and 14.31 it follows easily that 



w*{t)-w\t) ^0, z G {1,...,M}, a.s., 

and 

w\t)-w^{t) ^0, e {l,...,Mf , a.s. (4.9) 

This shows that the trajectories of the distributed versions of the quantizers 
reach asymptotically a consensus with probability 1. In other words, if one of 
the sequences {w*(t)}^o converges then they all converge towards the same 
value. The rest of the paper is devoted to prove that this common value is 
in fact a zero of VC, i.e. a critical point. 

To prove the result mentioned above, we will need the following assumption, 
which basically states that the components of w* are parted, for every time 
t but also asymptotically. This assumption is similar in spirit to the main 
requirement of Theorem 12.21 

Assumption 4.4 One has 

1. ¥{w*{t) ev^} = i, t> 0. 

2. F {liminft^oo dist {w*{t), {^V^) > O} = 1, t>0. 



4.4 Consistency of the DALVQ 

In this subsection we state our main theorem on the consistency of the 
DALVQ. Its proof is based on the Asynchronous G-Lemma, Theorem 14. 1[ 
The goal of the next proposition is to ensure that the first assumption of 
Theorem 14.11 holds. 



Proposition 4.1 Assume (AsY)-^ or (AsY)2 holds and that Assumptions 
\2.1\ and \4-2\ are satisfied then e* > 0, t > 0, > and Yl't^o ^* = C)0. 



The second condition required in Theorem 14.11 deals with the convergence of 
the two series defined by equations (14. 4p and (14. 5p . The next Proposition 14.21 
provides sufficient condition for the almost sure convergence of these series. 

Proposition 4.2 Assume (AsY)-^ or (AsY)2 holds and that Assumptions 
\2.1\ l^.ci] and \4-4\ are satisfied then the series 



and Ylt^o^^t'^'' converge almost surely in (M'^)'^. 
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This next proposition may be considered has the most important step in the 
proof of the convergence of the DALVQ. It estabhshes the convergence of a 
series of the form ^^o^*+i 11^^ The analysis of the convergence of 

this type of series is standard for the analysis of stochastic gradient method 
(see for instance Benveniste et al. [5] and Bottou [9]). In our context, we 
pursue the fruitful use of the agreement vector sequence, {w* (t)}'^^, and its 
related "steps", {e*}^o- 



Note that under Assumption 14.41 we have h{w*{t)) = WC {w*{t)) for all 
t > 0, almost surely, therefore the sequence {VC (w*(t))}^o below is well 
defined. 

Proposition 4.3 Assume (AsY)-^ or (AsY)2 holds and that Assumptions 
\2.1\ \Ji..l\ and |^.^| are satisfied then 

1. C{w%t)) ^Coo, a.s., 

where Coo is a [0, oo) -valued random variable, 

2. 

oo 

J2<+i\\^C{w%t))f<oc, a.s. (4.10) 
t=o 

Remark that from the convergence of the series given by equation fl4.10p one 
can only deduce that liminf^^oo II VC (tL'*(t))|| = 0. 

We are now in a position to state the main theorem of this paper, which 
expresses the convergence of the distributed version towards some zero of 
the gradient of the distortion. In addition, the convergence results fl4.9p 
imply that if a version converges then all the versions converge towards this 
value. 

Theorem 4.2 (Asynchronous Theorem) Assume (AsY) or (AsY)2 holds 
and that Assumptions \2. 1\ \4.1\ \4-^ \4-3\ and \4.4\ are satisfied then 

1. w*{t) - w\t) > 0, ? G {1, . . . , M}, a.s., 

t— >oo 

2. w\t)-w^{t) ^0, {i,j) e{l,...,MY, a.s., 

t— >oo 

3. dist (t(;*(t), Soo) )■ 0, a.s., 

t—^OD 

4. dist {w\ Hoo) > 0, 2 e {1, . . . , M}, a.s., 

where Sqo is some random connected component of the set {VC = 0} (1 Q'^ . 
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4.5 Annex 

Sketch of the proof of Asynchronous G-Lemma 14.11 The proof is 
an adaptation of the one found by Fort and Pages, Theorem 4 in p^. The 
recursive equation (14. 6p satisfied by the sequence {w*(t)}^Q is similar to the 
iterations (2) in PO] (with the notation of this paper): 

X*+i = X* - 6t+ih (X*) + 6t+i (AM*+i + r7*+i) , t>0. 

Thus, similarly, we define a family of continuous time stepwise function 

{u w (t, 

(0, u) = if M G [et + ... + <, + ... + u G [0, oo). 

and if M < w* (0, u) = w*{0). 

w* {t,u) = w* {0, e* + . . . + e* + u) , t>l and u G [0, oo). 
Hence, for every t G N, 

w*{t,u) =w*{0,t) - h{w*{t,v))dv + Ru{t), uG[0,oo), 
Jo 

where, for every t >1 and n G [£* + ...+ e*^j^^,,e\ + . . . + £:*^_j;_,_]^), 

/e\+...+e*+u t+t' 
w*{0,v)dv+ (AM«+AMf)). 
-?+•••+<+(/ s=t+l 

The only difference between the families of continuous time functions {w {t, u)}°. 
and {X^*)}^_^ defined in pUj is the remainder term Ru{t). The convergence 

sup \\Ru{t)\\ ^0, T>0. 

«e[o,r] 

follows easily from the third assumption of Theorem 14.11 The rest of the 
proof follows similarly as in Theorem 4 ^20j . 

□ 

Proof of Lemma 14. IL For all i G {1,...,M}, and all t > 0, and all 
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1 < i < K, we may write 
\\wl{t)-wm\\ 

M / t-l 
j = l \ r=0 

(by Definition 13.21 and Lemma 13.11) 

M t-l M 

< E \<P^'^{t, -1) - 0^-(-l)| ||«;^-(0)|| + EE - II^K^) 

i=l T=0 ,7 = 1 

M t-l M 

i=i ^=0 j=i 

(by Lemma [3.21) . 



Thus, 



H{t)-w*,{t)\\ 

M t-l M 

j=l r=0 i = l 

(by equation fg^ ) 

<v^'E 11^^0)11 

t-l M 



wI(t) - Z^_^J 



T=0 j = l 

Therefore, 



\wlit)-w',it)\\ 

AMdiam{g)p'+^ + Adiam(a)ii'2My ^-p*~" 



t-l 

T=0 

(because E Q and by Assumptions 14.11 and IH 
t-l 

< Adiam(G]KoM > ' ^^p*"^ 



Adiam(e;)ir2M V 

^ — ' r V i 



T=-l 
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Consequently, 



\w*{t)-w\t)\ 



t-1 



□ 



^-^ r V 1 

T = — 1 

This proves the desired result. 

Let us now introduce the following events: for any 6 > and t > 0, 

Al^{w''ir)eg^ , t>T>0}. 

Recall that the is a compact subset of Q'^ defined by equality (I2.8p . The 
next lemma establishes a detailed analysis of security regions for the parted 
components of the sequences {w*(t)}'^Q and {w^(t)}^Q. 



Lemma 4.2 Let Assumptions \4 ■ 1\ and \4-!^ hold. Then, 
1. there exists an integer t\ > 1 such that 



Moreover, 



w*{t) e K(t),«;^(t + 1)] c g^/,, t > t]. 



2. There exists an integer > 1 such that 

w\t) e g^ [w*{t),w\t)] C g^/^, ie{l,...,M} andt>tl 



Proof of Lemma 14.21 Proof of statement 1. The proof starts with 
the observation that under Assumption 14.31 we have w^{t) G g'^, for all i G 
{1, . . . , M} and t>0. It follows that, for any 1 < £ < k, 

\\H{zUi,w\t))^\\ < \\zl^,-wlit)\\ 
< diam(^). 
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Let us now provide an upper bound on the norm of the differences between 
two consecutive values of the agreement vector sequence. We may write, for 
all t > and all 1 < ^ < M, 

\\w^^(t+l)-wt{t)\\ 

M 
M 

<Y.ct>^{t)\\4{t)\\ 

M 

(by equation fl4.2p and statement 1. of Lemma 
Mdiam(^)i^'2 

(by Assumption I4.ip . 



< 



(4.11) 



Take t > |M diam(^)i^'2 and 1 < k i < M. Let a be a real number in the 
interval [0, 1]. 

If w*{t) e then 

II (1 - a)wm + aw'.it + 1) - (1 - a)wlit) - awlit + 1) || 

= \\w}{t) - wlit) + a {w*{t + 1) - w^) + « «it) -<{t + 1))|| 
> ||«;,^(t) - wlit) \\-\\a {w^{t + 1) - w^{t)) + a {wl{t) - wl{t + 1))|| 



> Mit) 
>5-2a 

> S/2. 



Wi 



a \\w^it + 1) - 



-a\\wl{t)-wl{t + l] 



5 



This proves that the whole segment [w*(t),w*{t + 1)] is contained in ^^^2- 

Proof of statement 2. Take t > 1 and 1 < £ < M. If w*{t) G then by 
Lemma 14.11 there exists t"^ such that 



y,{t) - wl{t)\\ <-, ^ e {1, . . . , M} and t > tl 
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Let k and i two distinct integers between 1 and M. For any t > t"^, 

\\awl{t) + (1 - a)wl{t) - aw\{t) - (1 - a)w'l{t)\\ 

= \Hit) - Kit) + - wlit)) + aK(t) - w\{t))\\ 

> \\wi{t) - wmw - « \Hit) - <it)\\ - « hm - Kit) 

> 6-2a- 
~ 4 

> S/2. 



This implies [w* (t) , w\t)] C desired. 

□ 

Proof of Proposition 14.11 By definition e*_^_i equals ^{teT3}(t^^ it)^t+i^ 

for alH > . 

On the one hand, since the real number 0-'(t) belongs to the interval [rj, 1] 
(by Lemma I3.2p e*^-^ is bounded from above by using the right-hand 
side inequahty of Assumption 14.11 



On the other hand, e*_^_i is bounded from below by the nonnegative real 
number using the left-hand side inequality of Assumption 14.11 Note also 
that as Assumption 14.21 holds, this real number is a positive one. Therefore, 
it follows that 

e* > 

t—^oo 

and 



t=o 



oo. 



□ 

Proof of Proposition 14. 2i Consistency of '^'^^^ AM^^K Let 5 be a 
positive real number and let t > t"^, where is given by Lemma I4.1U[ We 
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may write 

M 



Iai Y1 MteT^}<P'it)eUi \\h{w\t)) - h {w^{t)) II 

M 

(using statement 2. of Lemma [4.21 and the fact that VC = h on V^) 

M 

(by Lemma [2. II) 

(by Lemma [4. II) . 
Thus, since Ylt^o ^ ^ series 

oo M 

^ 1^. titeT.}<ly^it)eUi \\h{w\t)) - h {w^{t)) \\ 
t=o j=l 

is almost surely convergent. Under Assumption 14. 4[ we have 



U>Ot>0 J 



It follows that the series Xlt^o^^t converges almost surely in (M'^)''. 

Consistency of ^^^AAff^ The sequence of random variables M^^^ de- 
fined, for all t > 0, by 



r=0 
t M 



T=0 j = l 



is a vector valued martingale with respect to the filtration {J-jj^g. It turns 
out that this martingale has square integrable increments. Precisely, 
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EE 



t=o 



AM 



(2) 



Tt \ <oo. 



Indeed, for all j G {1, . . . , M} and t > 1, 

Ee{||1( (/^(^^■(r))-/f(zi+,(r),t.^(r)))|f | J-.j 

=1 

< E {||/. (t.^-(r)) - E (zi^i(r),«;^-(r))|f | J".} 



r=l 



T=l 



< 2 E ' E { 1 1 (t.^- (r ) ) 1 1 V 1 1 i7 (z^,^, (r) , t.^' (r) ) | f | } 



T=l 

<4/€diam(e;)2^(4+i) 

r=l 

(using Assumption H? 

< 



We conclude that the series X]t>i ^^t^ is almost surely convergent. 



□ 



Proof of proposition 14. 3L Denote by (x, y) the canonical inner product of 
two vectors |/ G M'^ and also, with a slight abuse of notation, the canonical 
inner product of two vectors x, y G (R'^)^ Let 5 be a positive real number. 
Take any t > max {t], t"^}, where tg and t'g are defined as in Lemma One 
has, 



l^t+iC (w^it + 1)) < l^tC (w^it + 1)) 



(by definition A^^^ C 
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Consequently, 



S 

< IaiC K(t)) + lAi{VC{w\t)),w*it + 1) - w^'it)) 
sup {||VC(2) - VC{w'{t))\\} \\w*{t + l)-w' 

z£[w*{t),w*(t+l)] 

< IaiC {w*{t)) + lAi{VC{w*{t)), w\t + 1) - w\t)) 
(using Lemma [2.1[ ) 



The first inequality above holds since the bounded increment formula above 
is valid by statement 1 of Lemma W?I\ Let us now bound separately the right 
hand side members of the second inequality. 

Firstly, the next inequality holds by inequality (14.111) provided in the proof 
of Lemma 14. 2[ 

„ II ^/ X ^/ Mi2 „ f KoM (\\a,'m{Q)\ 

Ps/2 \\w'{t + 1) - W*{t)f < kPs/2 ' ' ^ 



Secondly, 

l^.(VC(^^(t)),^*(t + l)-u;^(t)) 



M 

lAi{VC{w%t)),Y,<P'{t)s\t)) 
i=i 

(by equation (13. 5p ) 

M 

lA^5Z(VC(t^^(t)),0^-(t).^(t)) 

M 
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Thus, 

i^.(vcK(t)),t/7^(t + i)-^*(t)) 

M 

i=i 

M 

M 

<1^. J](VCK(t)),</^-(t).^-(t)) 

M 

+ pC{w*it))-^C{w^{t))\\ ||0^'(t)s^-(t)|| 

i=i 

(using Cauchy-Schwarz inequality). 

Therefore, 

tAl{VC{w\t)),w\t + l)-w\t)) 
M 

i=i ^ 

(by statement 2 of Lemma 14.21) 

M 

<tA^^Y.^VC{w^{t)),<P^{t)s^{t)) 

+ Ps,2Y,\\w\t)-W^{t)\\\\(l>^{t)s^{t)\\ 

i=i 

(using Lemma 12.11) 
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M 

n 

+ Ps/2AKiKM^ diam(^)2^ 
(using Lemma |4?T] and the upper bound (14. lip ). 



Finally, 



M 

2,. nj2 J- ™//-N2^t 



r K2M di&m(g)\^ 
+ ^Ps/2(^— • (4-12) 



Set 
and 



Q] 4 Ps/2AK'^kM'' diam{gy 



ni ^ kPs/2 iK2M diam{g)y . 
In the sequel, we shall need the following lemma. 

Lemma 4.3 For allt > max{tj,t^}, the quantity Wt below is a nonnegative 

00 

t-l , M 



supermartingale with respect to the filtration {J^t}'^o- 



Wt ^ Ia^C (^*(t)) + vKi E 7 E l{-eT.} II VC (w^ir)) \ 

T=0 j = l 

+ ^ + 1, i>i. 



T=t T=t 
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Proof of Lemma 14.31 Indeed, using the upper bound provided by equation 

(1113), 



M 



tA^C{w\t)) 
M 



M 



<l^*CK(t)) 



^4 



i=i 

In the last inequality we used the fact that (j)^{t) > rj (Lemma 13.21) and 
e^+i > ^ (Assumption liTT]) . 

It is straightforward to verify that, we have Wt — K{Wt+i\J^t} > which 
prove the desired result. 

□ 

Proof of Proposition 14.31 (continued). Since {W^tj^j^ is a nonnegative 
supermartingale (by Lemma H75]l . Wt converges almost surely as t — )■ oo (see 

for instance Durrett [E]). Then, as YlT=t ^ ^ 'nT=t ^ 0, 

we have 

lAtC{w*{t)) > Coo, a.s., (4.13) 

where Coo ^ [0, oo) and, because the origin of the expression is increasing in 
t, the following series converges 

oo 1 

Ei^J7VTEi{-e^^}|l^^(^'(^))ir<^' ^-^^ (4.14) 

r=0 3=1 
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Proof of statement 1. Assumption 14.41 means that 



U>Ot>0 J 



Statement 1 follows easily from the convergence fl4.13p . 

Proof of statement 2. The required convergence (14.101) is proven as follows. 

We have 

t 

t M 

< E E </'^»l^eT.}lAj4+i live {w\r))f 

r=0 i = l 

(using equality (14.31) ) 

t ^ M 

r=0 j = l 

(using Assumption 14. 2p 

t M 
r=0 ' ' j = l 

(using Assumption 14.21 and statement 2 of Lemma [4.21 ) 



Thus, 

t 

E<+il^J ||VC(^*(r))||^ 

T=0 

t ^ M 

^ E l^JTVI E l^eT.} II VC (^^-(r^ 

r=0 j=l 

r=0 'J i = l 

(by Lemma [2. ip . 



T) — W [T 



2 
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Thus, 

t 

E<+il^J ||VC(^^(r))||^ 

T=0 

t ^ M 

r=0 j=l 

+ 2PI,KIkM'A' diam(^)2 ^ TV~fr 
(by Lemma [4. ip . 
Finally, using the convergence (14.141) . one has 

oo 

E<+ilAj||VCK(r))f <oo, a.s., 

T=0 

and the conclusion follows from the fact that Assumption 14.41 implies 

U>Ot>0 ) 

□ 

Proof of Theorem 14.21 The proof consists in verifying the assumptions of 
Theorem 14.11 with the function G defined by equation (12. 9p . 

It has been outlined that Assumption 14 . 31 implies that w*{t) lie in the compact 
set Q*^, almost surely, for all t > 0. Consequently, in the definition of G{w*) 
the liminf symbol can be omitted. For all z & Q and all t > 0, we have 
\\H{z,w*(t))\\ < y/K diam (Q) , almost surely, whereas {h{w* {t))}'^^^ satisfies 

h{w\t)) =E{H (z, w*{t)) \J^t}, t>0, a.s. 

Thus, the sequences {w*{t)}'^^ and {h{w* (t))}'^^ are bounded almost surely. 

Proposition 14. H respectively Proposition 14. 2[ respectively Proposition 14.31 
show that the first assumption, respectively the third assumption, respec- 
tively the fourth assumption of Theorem 14.11 hold. This concludes the proof 
of the theorem. 

□ 
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Justification of the statements ( 14. 8p . Recall that the definition of 9 is 
provided in Lemma 14.11 Let us remark that it is sufficient to analyse the 
behavior in t of the quantity X]t=i P* '^Z''"- 
Let e > then for all t > [l/e\ + 1, we have 



t-i 

r=l 



t-T 



P 

r 

U/eJ t-T 

r=l r=[l/£j + l 

-r=l r=Ll/£j+l 
< h + 



1-p 1-p 

(using the fact that p G (0, 1)). 
Consequently, for t sufficiently large we have 



t-i 



< 



T=l ^ 

which proves the first claim. 

The second claim follows the same technique by letting "e = 1/ 
Thus, for t > 1 we have 

et < + 

1-p 1-p 

Finally, for T > 1, it holds 

T t-l / T T 

EEV^T^ E^'-^^^-' + E^i. 

t=l T=l ^ \t=l t=l 

The two partial sums in the above parenthesis have finite limits which prove 
the second statement. 
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